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Abstract.  The  problem  of  learning  to  mimic  a  human  expert /teacher 
from  training  trajectories  is  called  imitation  learning.  To  make  the  pro¬ 
cess  of  teaching  easier  in  this  setting,  we  propose  to  employ  transfer  learn¬ 
ing  (where  one  learns  on  a  source  problem  and  transfers  the  knowledge  to 
potentially  more  complex  target  problems).  We  consider  multi-relational 
environments  such  as  real-time  strategy  games  and  use  functional-gradient 
boosting  to  capture  and  transfer  the  models  learned  in  these  environ¬ 
ments.  Our  experiments  demonstrate  that  our  learner  learns  a  very  good 
initial  model  from  the  simple  scenario  and  effectively  transfers  the  knowl¬ 
edge  to  the  more  complex  scenario  thus  achieving  a  jump  start,  a  steeper 
learning  curve  and  a  higher  convergence  in  performance. 


1  Introduction 

It  is  common  knowledge  that  both  humans  and  animals  learn  new  skills  by 
observing  others.  This  problem,  which  is  called  imitation  learning ,  can  be  for¬ 
mulated  as  learning  a  representation  of  a  policy  -  a  mapping  from  states  to 
actions  -  from  examples  of  that  policy.  Imitation  learning  has  a  long  history 
in  machine  learning  and  has  been  studied  under  a  variety  of  names  including 
learning  by  observation  [1],  learning  from  demonstrations  [2],  programming  by 
demonstrations  [3],  programming  by  example  [4],  apprenticeship  learning  [5], 
behavioral  cloning  [6],  and  some  others.  Techniques  used  from  supervised  learn¬ 
ing  have  been  successful  for  imitation  learning  [7].  We  follow  this  tradition  and 
investigate  the  use  of  supervised  learning  methods  to  learn  behavioral  policies. 

Our  focus  is  on  relational  domains  where  states  are  naturally  described  by 
relations  among  an  indefinite  number  of  objects.  Examples  include  real  time 
strategy  games  such  as  Warcraft,  regulation  of  traffic  lights,  logistics,  and  a 
variety  of  planning  domains.  A  supervised  learning  method  for  imitation  learning 
was  recently  proposed  [8].  This  approach  assumes  an  efficient  hypothesis  space 
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Fig.  1.  Wargus  Scenarios  (left)  The  two  tower  scenario  where  providing  examples  is 
easier,  right  The  three  tower  scenario  which  is  significantly  more  complicated  and 
requires  more  training  trajectories. 


for  the  policy  function,  and  learns  only  policies  in  this  space  that  are  closest  to 
the  training  trajectories  [9,10].  This  approach  is  based  on  functional  gradient 
boosting  [11]  where  a  set  of  relational  regression  trees  [12]  are  used  to  compactly 
represent  a  complex  relational  policy.  This  approach  was  demonstrated  to  be 
successful  in  many  problems. 

One  of  the  key  assumptions  in  the  proposed  approach  is  that  the  policies  can 
be  generalized  across  the  objects  in  the  domain.  While  one  of  the  advantages 
of  a  logical  representation  is  the  generalization  capability,  it  is  also  quite  possi¬ 
ble  that  in  several  large  problems,  the  optimal  policies  can  vary  greatly  as  the 
number  of  the  objects  in  the  domains  can  increase.  In  such  cases,  the  learner 
has  to  be  provided  with  new  example  trajectories  to  learn  the  policies.  Since  the 
complexity  of  the  domain  has  increased,  the  number  of  trajectories  required  for 
learning  can  also  increase  significantly.  For  instance,  consider  the  two  scenarios 
presented  in  Figure  1  where  the  goal  is  to  defend  the  towers  from  being  destroyed 
by  the  enemy  units.  In  the  left  figure,  there  are  two  towers  and  two  enemy  and 
friendly  footmen  and  archers.  In  the  right  figure,  all  the  numbers  increase  by 
one.  As  we  show  empirically,  the  optimal  policies  for  the  two  scenarios  can  be 
very  different.  More  importantly,  the  number  of  trajectories  required  to  converge 
to  the  optimal  policy  is  higher  in  the  case  of  the  more  complex  scenario. 

In  order  to  train  on  such  complex  scenarios,  we  propose  to  employ  transfer 
learning  [13, 14]  for  learning  in  a  (simpler)  source  problem  and  then  transfer¬ 
ring  the  learned  knowledge  to  a  (more  complex)  target  task.  More  precisely, 
we  aim  to  employ  transfer  by  initialization  [15]  where  the  models  learned  from 
the  source  task  are  used  to  initialize  the  models  in  the  learning  task.  Following 
prior  work  [8],  we  perform  search  through  the  space  of  policies  using  functional 
gradient  boosting  but  initialize  the  gradients  with  the  models  learned  in  the 
source  task.  Our  hypothesis  is  that  this  initialization  will  allow  the  learner  to 
explore  more  complex  policy  spaces  that  might  not  have  been  accessed  easily  if 
the  search  started  out  with  uniform  policies.  We  verify  this  claim  empirically. 
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In  summary,  we  consider  the  problem  of  imitation  learning  in  relational  do¬ 
mains  where  the  optimal  policies  can  be  significantly  different  as  the  number  of 
objects  in  the  domain  increases.  Generalization  of  policies  is  still  a  desired  prop¬ 
erty  as  the  properties  of  the  objects  themselves  can  change  across  situations  with 
the  same  number  of  objects.  When  the  number  of  objects  change,  we  propose  to 
employ  transfer  learning  by  initialization  to  initialize  the  gradients  in  the  target 
task.  We  evaluate  the  hypothesis  in  a  real  time  strategy  game  and  show  that  we 
are  able  to  achieve  a  jump  start,  faster  convergence  to  a  more  optimal  policy. 

The  rest  of  the  paper  is  organized  as  follows:  we  introduce  the  background 
and  the  prior  work  on  relational  imitation  learning  next.  We  then  present  our 
transfer  algorithm  for  initialization  and  evaluate  the  algorithm  on  a  complex 
RTS  game  and  conclude  the  paper  by  outlining  some  challenges  for  future  work. 

2  Background 

An  MDP  is  described  by  a  set  of  discrete  states  S,  a  set  of  actions  A,  a  reward 
function  rs(a)  that  describes  the  expected  immediate  reward  of  action  a  in  state 
s,  and  a  state  transition  function  p^s,  that  describes  the  transition  probability 
from  state  s  to  state  s'  under  action  a.  A  policy,  7 r,  is  defined  as  a  mapping 
from  states  to  actions,  and  specifies  what  action  to  execute  in  each  state.  In  the 
imitation  learning,  we  assume  that  the  reward  function  is  not  directly  obtained 
from  the  environment.  Our  input  consists  of  S,  A  and  supervised  trajectories 
generated  by  a  Markov  policy.  We  try  to  match  it  using  a  parameterized  policy. 

3  Relational  Imitation  Learning 

Following  Ratliff  et  al.  [16],  we  assume  that  the  discount  factor  are  absorbed  into 
the  transition  probabilities  and  policies  are  described  by  p  £  G  where  G  is  the 
space  of  all  state-action  frequency  counts.  We  assume  a  set  of  features  F  that 
describe  the  state  space  of  the  MDP  and  the  expert  chooses  the  action  ai  at  any 
time  step  i  based  on  the  set  of  feature  values  {ft)  according  to  some  function. 
For  simplicity,  we  denote  the  set  of  features  at  any  particular  time  step  i  of  the 
jth  trajectory  as  f;  and  we  drop  j  whenever  it  is  fairly  clear  from  the  context. 

The  goal  of  our  algorithm  is  to  learn  a  policy  that  suitably  mimics  the  expert. 
More  formally,  we  assume  a  set  of  training  instances  {(f; ,  }"=1  that  is 

provided  by  the  expert.  Given  these  training  instances,  the  goal  is  to  learn  a 
policy  /i  that  is  a  mapping  from  fj1  to  aj  for  each  set  of  features  f(.  The  key 
aspect  of  our  setting  is  that  the  individual  features  are  relational  i.e. ,  objects 
and  relationships  over  these  objects.  The  features  are  denoted  in  standard  logic 
notation  where  p(X)  denotes  the  predicate  p  whose  argument  is  X .  The  problem 
of  imitation  learning  given  these  relational  features  and  expert  trajectories  can 
now  be  posed  as  a  regression  problem  or  a  supervised  learning  problem  over 
these  trajectories. 

In  our  previous  work  [8],  we  employed  Functional-Gradient  Boosting  for 
learning  relational  policies.  The  goal  is  to  find  a  policy  p  that  is  captured  using 
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the  trajectories  (i.e.,  features  f;  and  actions  aj)  provided  by  the  expert,  i.e. , 
the  goal  is  to  determine  a  policy  /z  =P(ai|fi;V’)  Va,i  where  the  features  are 
relational.  These  features  could  define  the  objects  in  the  domain  (squares  in  a 
gridworld,  players  in  robocup,  blocks  in  blocksworld,  archers  or  footmen  in  a 
real-time  strategy  game  etc.),  their  relationships  (type  of  objects,  teammates 
in  robocup  etc.),  or  temporal  relationships  (between  current  state  and  previous 
state)  or  some  information  about  the  world  (traffic  density  at  a  signal,  distance 
to  the  goal  etc.). 

We  assume  a  functional  parametrization  over  the  policy  and  consider  the 
conditional  distribution  over  actions  a*  given  the  features  to  be, 

P(ai |fi;V0  =  e^a^lYjeHa'iA\Val  G  A  (1) 


where  ^(a*;  fi)  is  the  potential  function  of  at  given  the  grounding  f)  of  the 
feature  predicates  at  state  Si  and  the  normalization  is  over  all  the  admissible 
actions  in  the  current  state.  Formally,  functional  gradient  ascent  starts  with  an 
initial  potential  ipo  and  iteratively  adds  gradients  A;.  Here,  Am  is  the  functional 
gradient  at  episode  m  and  is 


Am  — T]m  x  EX  y  \d / dipm—ilog  P{y\x]  VVn— 1)]  (2) 

where  rjm  is  the  learning  rate.  Note  that  in  Equation  2,  the  expectation  EXtV[..\ 
cannot  be  computed  as  the  joint  distribution  P(x,  y)  is  unknown  (in  our  case, 
y' s  are  the  actions  while  x’s  are  the  features).  Instead  of  computing  the  gradi¬ 
ents  over  the  potential  function,  the  gradients  are  computed  for  each  training 
example: 

=  V^^^/o5(P(a^|fi);^))|v;m_1  (3) 

i  i 

These  are  point-wise  gradients  for  examples  (fj1 ,  aj)  on  each  state  i  in  each  trajec¬ 
tory  j  conditioned  on  the  potential  from  the  previous  iteration(shown  as  |  ^m_1)- 
Now  this  set  of  local  gradients  form  a  set  of  training  examples  for  the  gradient  at 
stage  m.  The  main  idea  in  the  gradient-tree  boosting  is  to  fit  a  regression-tree  on 
the  training  examples  at  each  gradient  step  [17].  The  idea  of  functional  gradient 
boosting  is  presented  in  Figure  2. 

The  functional-gradient  w.r.t  ip(aj;ij)  of  the  likelihood  for  each  example 
(fj1,  aj)  is  given  by: 


d log  P(aj\{(;  ip) 
dip{aj;f() 


I(<4  =  ~  P(<4 I^VO 


(4) 


where  aj  is  the  action  observed  from  the  trajectory  and  /  is  the  indicator  func¬ 
tion  that  is  1  if  aj  =  aj  and  0  otherwise.  The  key  feature  of  the  above  expression 
is  that  the  functional-gradient  at  each  state  of  the  trajectory  is  dependent  on 
the  observed  action  a.  If  the  example  is  positive  (i.e.,  it  is  an  action  executed 
by  the  expert),  the  gradient  (/  —  P)  is  positive  indicating  that  the  policy  should 


Final  Model ; 


>0  +  .£VX\MJ 


Fig.  2.  Relational  FGB.  This  is  similar  to  the  standard  FGB  where  trees  are  induced 
in  stage-wise  manner;  the  key  difference  being  that  the  trees  are  relational  regression 
trees.  To  compute  the  predictions,  a  query,  x  is  applied  to  each  tree  in  turn,  and  the 
numerical  values  at  the  leaf  reached  in  each  tree  are  summed  to  obtain  4’ix)- 


increase  the  probability  of  choosing  the  action.  On  the  contrary  if  the  example  is 
a  negative  example  (i.e. ,  for  all  other  actions),  the  gradient  is  negative  implying 
that  it  will  push  the  probability  of  choosing  the  action  towards  0. 

Following  prior  work  [18-20],  we  used  Relational  Regression  Trees  (RRTs)[12] 
to  fit  the  gradient  function  at  every  feature  in  the  training  example  [8].  Hence 
the  distribution  over  each  action  is  represented  as  a  set  of  RRTs  on  the  features. 
These  trees  are  learned  such  that  at  each  iteration  the  new  set  of  RRTs  aim  to 
maximize  the  likelihood.  Hence,  when  computing  P(a(X)\  f(x))  for  a  particular 
value  of  state  variable  X  (say  x),  each  branch  in  each  tree  is  considered  to 
determine  the  branches  that  are  satisfied  for  that  particular  grounding  ( x )  and 
their  corresponding  regression  values  are  added  to  the  potential.  For  example, 
A'  could  be  a  particular  unit  in  Wargus.  or  a  certain  block  in  the  blocksworld. 


4  Relational  Transfer 


In  this  work,  we  extend  the  previous  work  in  imitation  learning  by  employing 
the  ideas  for  inductive  transfer  [13].  While  the  previous  approach  was  able  to 
achieve  generalization  in  a  imitation  learning  setting,  the  generalized  policies 
might  not  be  sufficient  in  some  other  variations  of  the  problems.  For  instance, 
in  the  scenarios  considered  in  Figure  1,  the  optimal  policies  for  the  three  tower 
defense  scenario  can  be  significantly  different  from  the  easier  task  of  two  tower 
scenario.  Also,  since  the  three  tower  case  is  a  harder  task,  as  we  show  empirically, 
learning  in  this  setting  might  require  more  example  trajectories  from  the  expert. 
In  such  cases,  it  is  easier  to  transfer  the  knowledge  gained  from  the  two  tower 
scenario  to  the  three  tower  case  for  initialization  and  then  improve  upon  the 
knowledge  by  learning  in  the  three  tower  case.  This  will  enable  the  learner  to: 
(a)  learn  a  better  policy  than  the  one  generalized  from  the  two  tower  scenario 
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Table  1.  Transfer  Learning  Algorithm 
function  Transfer (Tso^r-ce , Ttarget ) 

As  =  TIL({ } ,TSource')  >  Learn  with  source  Trajectories 

At  =  TIL (As, Ttarget)  >  Use  learned  models  and  learn  on  Target  Trajectories 

return  At 
end  function 

function  TIL(A,  Trajectories  T) 

A0  =  A 

for  1  <  k  <  |  Aj  do  t>  Iterate  through  each  action 

for  1  <  m  <  M  do  >  M  gradient  steps 

Sk  :=  GenExamples(k;T;  Aljn_1) 

Am(k)  :=  FitRRT(Sk’,  L)  o  Gradient 

A™  :=  Am-i  +  Am{k)  >  Update  models 

end  for 

P(A  =  fc|f)  oci/jk 

end  for 
return  A 
end  function 

function  GenExamples(1c,  T,  A) 

S:=0 

for  1  <  j  <  |T|  do  >  Trajectories 

for  1  <  *  <  | S'1 1  do  >  States  of  trajectory 

Compute  P(aJi  =  fc| fj )  >  Probability  of  user  action  being  the  current 

action 

4(*;  f})  =  =  k)-  P(a{  =  fc|fJ) 

S  :=  S  U  [(fil ,  f9 ) ,  A(aj ;  f9 ))]  >  Update  relational  regression  examples 

end  for 
end  for 

return  S  >  Return  regression  examples 

end  function 


and  (b)  converge  to  the  optimal  policy  faster  in  the  three  tower  scenario  i.e. , 
from  fewer  trajectories  when  compared  to  learning  with  no  knowledge. 

The  form  of  the  functional  gradients  facilitate  easy  transfer.  Since  they  per¬ 
form  gradient  descent  in  function  space,  we  can  initialize  the  models  (i/> o  )  for 
the  three  tower  scenario  with  some  of  the  trees  learned  in  the  two  tower  sce¬ 
nario.  Conceptually,  this  is  essentially  the  same  as  using  the  result  of  the  first 
few  gradient  steps  in  the  source  problem  while  learning  in  the  target  problem. 
After  initializing  the  gradient  ascent  with  the  initial  set  of  trees,  we  propose  to 
learn  new  set  of  trees  in  the  target  task  that  build  upon  the  initial  model.  As 
mentioned  earlier,  this  initial  set  of  trees  for  the  i()q. 

To  perform  learning  in  the  target  task,  the  trajectories  must  be  weighted 
given  the  initial  model.  Similar  to  Equation  4,  we  compute  the  value  of  /(a^  = 
a^|f;)  —  P(a^ |f;;^o)  f°r  each  action  of  each  trajectory,  i.e.,  the  weight  of  each 
observed  action  is  the  difference  between  the  indicator  function  of  that  action 
and  the  marginal  probability  of  that  action  given  the  initial  potential  function. 
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Initial  Model:  2  Tower  Examples  Refined:  3  Tower  Examples 


Source  Task 


Source  Task 


+ 


+ 


Target  Task  Target  Task 


Fig.  3.  Proposed  transfer  approach.  The  key  idea  is  to  learn  a  small  set  of  trees  from 
the  source  task  and  use  them  to  initialize  the  RFGB  algorithm  for  the  harder  task. 


Once  these  weights  are  computed  for  the  given  trajectory,  they  serve  as  the 
examples  for  learning  new  set  of  trees.  This  idea  is  presented  in  Figure  3.  First 
we  learn  a  few  set  of  trees  in  the  source  task  and  then  use  them  to  initialize  the 
models  in  the  target  task.  We  then  learn  a  new  set  of  trees  in  the  target  task. 

Our  proposed  transfer  learning  algorithm  is  presented  in  Algorithm  1.  The 
function  Transfer  is  the  main  algorithm  that  takes  as  input  trajectories  from 
the  source  and  the  target  tasks.  The  algorithm  first  calls  the  TIL  function  (which 
stands  for  Tree  —  basedlmitationLearning)  with  an  empty  potential  function 
(empty  set  of  trees)  and  the  source  trajectories.  The  TIL  function  then  learns  a 
set  of  trees  in  the  source  task.  The  TIL  function  is  the  same  one  presented  in  [8] 
with  the  modification  that  it  can  use  an  initial  set  of  trees.  For  each  action  (fc), 
it  generates  the  examples  for  our  regression  tree  learner  (called  using  function 
FitRRT)  to  get  the  new  regression  tree  and  updates  its  model  (A^J.  This  is 
repeated  up  to  a  pre-set  number  of  iterations  M  (typically,  M  =  20).  We  found 
empirically  that  increasing  M  has  no  effect  on  the  performance  as  the  example 
weights  nearly  become  0  and  the  regression  values  in  the  leaves  are  close  to  0  as 
well.  Note  that  the  after  rn  steps,  the  current  model  will  have  m  regression 
trees  each  of  which  approximates  the  corresponding  gradient  for  the  action  k. 
These  regression  trees  serve  as  the  individual  components  ( Am(k ))  of  the  final 
potential  function. 
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Once  the  set  of  trees  have  been  learned  in  the  source  task,  a  subset  of  those 
trees  (typically  we  use  20  trees  in  our  experiments),  is  then  used  as  the  initial 
model  for  the  target  task  and  the  TIL  function  is  called  with  this  initial  set  and 
the  trajectories.  The  function  then  returns  a  new  set  of  trees  which  are  then  used 
for  evaluating  in  the  target  task.  It  must  be  mentioned  that  when  choosing  to 
act  in  the  target  task,  inference  over  the  actions  is  performed  using  all  the  trees 
(the  initial  set  of  source  trees  plus  the  target  trees).  It  is  easy  to  see  that  we 
cannot  ignore  the  transferred  trees  since  they  form  the  first  step  of  the  gradient 
ascent  when  learning  the  policy  in  the  target  domain. 

The  proposed  approach  is  closely  related  to  the  idea  of  modular  policies  of 
Driessens  [21].  He  observed  that  the  use  of  functional  gradients  to  represent 
policies  allows  us  to  separate  the  gradient  updates  to  different  subtasks  of  the 
agent’s  task.  Thus,  we  can  create  separate  potential  functions  for  each  part  of 
the  task  and  the  natural  addition  operator  of  functional  gradients  allows  then  to 
obtain  the  final  policy  which  is  essentially  a  sum  of  different  regression  trees.  We 
extend  the  above  to  transfer  learning  where  we  consider  an  initial  set  of  trees 
for  a  different  task  (or  potentially  a  subtask)  and  a  new  set  of  trees  are  then 
learned  for  the  new  task.  Hence,  combining  these  two  ideas,  it  is  possible  to  learn 
a  higher  level  policy  in  a  hierarchy  by  transferring  from  the  lower  level  subtasks. 

5  Experiments 

We  present  the  empirical  evaluation  of  our  proposed  algorithm  on  a  real-time 
strategy  game.  We  are  particularly  interested  in  the  following  questions: 

Ql:  How  do  the  transferred  models  compare  against  the  models  that  are  general¬ 
ized  using  the  relational  imitation  learning  algorithm? 

Q2:  How  do  the  transferred  models  compare  against  the  models  that  are  learned 
directly  on  the  target  task  with  no  prior  models  from  source  task? 

Experimental  Setup:  Stratagus  is  an  open-source  real-time  strategy  (RTS)  game 
engine  written  in  C  based  off  the  Warcraft  series  of  games.  Like  all  RTS  games,  it 
allows  multiple  agents  to  be  controlled  simultaneously  in  a  fully  observed  setting, 
making  an  ideal  test  bed  for  imitation  learning.  A  java  client  was  written,  revised 
at  Oregon  State  University6,  to  connect  to  the  Stratagus  game  engine  via  a  socket 
connection.  The  client  collects  all  of  the  game  information  from  the  game  engine 
and  can  issue  detailed  commands  to  all  units  of  a  player  in  the  game.  This  client 
allows  for  the  learned  policies  to  be  executed  directly  in  the  game  environment 
as  opposed  to  simulation  creating  more  realistic  performance  metrics. 

The  setting  in  which  transfer  is  being  tested  is  the  tower  defense  scenario 
shown  in  Figure  1.  The  map  used  for  the  experiments  consisted  of  6  x  6  grid 
world.  Our  scenarios  consist  of  two  opposing  teams-one  attacking,  one  defending- 
each  with  two  kinds  of  units.  Footman  have  more  health  but  must  be  close  to  an 
enemy  to  attack  them  while  ranged  archers  are  easily  killed  but  can  attack  from 
a  distance.  Towers  exist  on  the  map  in  set  locations.  The  defending  team  must 


http:  / /beaversource. oregonstate.edu/projects/stratagusai 
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prevent  the  attacking  team  from  destroying  the  towers  while  the  attacking  team 
must  destroy  as  many  towers  as  they  can.  The  defending  team  must  divide  its 
units  among  the  various  towers  to  prevent  one  tower  from  falling  while  another 
is  being  saved.  This  dynamic  creates  complex  policies. 

We  used  the  following  fea¬ 
tures  to  describe  the  state: 
the  strength  (high,  medium, 
low)  and  location  of  all 
friendly  units,  the  type  (foot¬ 
man,  archer)  of  all  units  in 
the  game,  which  tower  each 
enemy  unit  is  currently  at¬ 
tacking,  and  the  enemy  unit 
that  all  friendly  units  were 
attacking  the  previous  state. 
Friendly  units  are  unaware  of 
the  strength  of  enemy  units 
or  their  exact  location.  The 
full  set  of  information  given 
at  every  state  is  included  in 
Table  4.  The  actions  available 
to  the  friendly  units  are  to 
move  to  a  location  and  attack 
a  particular  unit.  The  nature  of  the  objects  and  the  relationships  between  the 
objects  in  this  game  naturally  allow  for  a  relational  representation.  Each  type  of 
unit  (footman  or  archers)  shares  traits  such  as  their  attacking  range  and  their 
total  health  so  certain  policy  rules  will  naturally  apply  to  all  units  of  that  type. 

As  mentioned  earlier,  the  goal  of  this  experiment  is  to  learn  to  protect  two 
towers  in  a  source  scenario  and  transfer  the  learned  knowledge  to  a  target  sce¬ 
nario.  The  attacking  team’s  strategy  is  as  follows:  At  the  start  of  each  game,  each 
member  of  the  attacking  team  randomly  selects  a  tower  to  attack.  However,  if 
approached  by  an  enemy  archer  or  footmen,  they  will  change  their  target  to 
eliminate  the  opposing  player’s  offensive  units.  After  destroying  one  tower,  they 
will  randomly  select  another  tower  to  attack  until  there  are  no  towers  left  and 
the  game  ends.  The  goal  of  the  game  is  to  defend  the  towers.  The  game  ends  if 
either  the  enemy  team  manages  to  destroy  all  of  the  towers  or  the  friendly  team 
kills  all  of  the  enemy  units.  The  number  of  friendly  and  attacking  units  vary 
between  the  source  and  target  scenarios.  In  the  source  scenario,  there  are  two 
footmen  and  two  archers  while  in  the  target  scenario,  there  are  three  footmen 
and  three  archers.  In  both  the  scenarios,  the  towers  cannot  defend  themselves. 

Results:  We  used  two  performance  metrics  that  are  based  on  the  number  of 
towers  saved.  The  3-tower  win  percentage  is  the  percentage  of  games  in  which 
all  three  towers  were  saved.  This  is  a  difficult  task  because  it  requires  the  friendly 
team  to  defend  all  three  towers  simultaneously;  else  one  tower  may  fall  while  they 
are  all  defending  the  others.  The  1-tower  win  percentage  is  the  easier  metric, 


Predicates 

Description 

friendlyobject 

enemyobject 

dead 

locationld 

strength 

distance 

enemyattower 

attacking 

The  type  of  defending  unit 

The  type  of  attacking  unit 
Enemy  unit  that  is  dead 
Location  of  friendly  unit 
Strength  (hit  points)  of  friendly  unit 
Distance  of  a  friendly  unit  to  a  tower 
Tower  that  enemy  unit  is  attacking 
Enemy  unit  that  a  friendly  unit 
was  attacking  in  the  previous  state 

Fig.  4.  Features  that  describe  the  state  in  the  two 
scenarios.  We  omit  the  arguments  of  the  predicates 
for  brevity. 
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Fig.  5.  Results  of  saving  at  least  one  tower  (left)  and  saving  all  three  towers(right). 
The  transferred  models  dominate  the  one  learned  from  two  towers  (generalization)  and 
the  one  learned  on  three  towers  without  an  initial  policy  (non-transfer) 


only  requiring  1  of  the  3  towers  to  be  saved.  We  used  randomly  selected  samples 
of  30,  50,  70,  100,  and  150  3-tower  games  from  a  pool  of  approximately  1000 
expert  games  for  training  in  the  target  scenario.  For  the  source  scenario,  we  used 
100  games  for  training.  This  experiment  was  repeated  10  times  and  the  average 
percentage  of  winnings  games  were  computed. 

We  used  three  models  for  evaluation  -  the  transfer  model,  the  non-transferred 
model  which  learns  only  on  the  target  task  and  the  generalization  model  which 
learns  only  on  the  source  and  not  the  target  task.  We  learned  20  trees  on  the 
source  task  and  learned  a  further  20  on  the  target  task.  Hence,  the  final  transfer 
model  had  40  trees  while  the  other  two  models  had  20  trees  each. 

The  results  are  presented  in  Figures  5  corresponding  to  saving  at  least  one 
tower  and  three  towers  respectively.  As  can  be  seen  from  the  figures,  the  transfer 
model  is  superior  to  both  the  generalization  and  the  pure  imitation  learner  in  the 
twin  tasks.  Also,  the  results  follow  the  original  transfer  learning  goals  of  jump 
start,  steeper  learning  curve  and  better  convergence.  It  is  fairly  clear  that  in 
defending  at  least  one  tower,  the  use  of  the  initial  models  from  the  source  scenario 
provides  a  bigger  jump  start  than  in  defending  all  towers.  On  the  other  hand, 
when  saving  all  towers,  the  jump  start  is  not  fairly  high  but  the  learning  curve 
is  steeper.  It  appears  that  the  use  of  initial  models  allow  for  the  learner  in  the 
target  scenario  to  explore  a  space  of  policies  that  might  not  have  been  otherwise 
reached  from  an  uniform  policy.  Similarly,  the  difference  between  transfer  and 
non-transfer  models  is  higher  in  the  one  tower  case  than  saving  all  the  towers 
though  the  difference  in  statistically  significant  in  both  the  cases.  Also,  it  must 
be  mentioned  that  even  when  the  non-transferred  models  were  provided  with  a 
lot  more  trajectories,  it  is  not  able  to  match  the  performance  of  the  transferred 
model.  This  suggests  that  using  initial  policies  in  some  cases  is  more  useful  than 
obtaining  more  expert  trajectories.  It  would  be  interesting  to  evaluate  these 
results  in  other  domains  as  well. 

In  summary,  our  experiments  answer  both  the  questions  affirmatively  in  that 
the  transfer  models  dominate  both  the  original  imitation  learning  models. 
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6  Discussion  and  Conclusion 

We  address  the  issue  of  sample  complexity  in  imitation  learning  settings.  In  sce¬ 
narios  where  the  expert’s  time  is  expensive/valuable  and  we  have  access  to  only 
a  few  training  examples  from  the  expert  our  approach  is  to  divide  the  expert’s 
time  between  simple  (smaller  domain  size)  and  harder  (larger  domain  size)  prob¬ 
lems.  Although  policies  induced  from  the  simpler  problem  training  instances  can 
be  employed  to  solve  the  larger  domain  via  relational  generalization,  in  scenarios 
we  provide  (such  as  Wargus)  this  does  not  translate  to  better  performance.  We 
have  presented  transfer  learning  by  using  the  simpler  policy  as  our  initial  models 
and  building  an  updatable  relational  model  by  learning  from  the  harder  exam¬ 
ples.  We  observe  not  only  a  superior  performance  to  generalization  but  also  a 
drastic  reduction  in  the  sample  complexity  as  compared  with  the  naive  method 
of  directly  inducing  a  model  on  the  complex  examples. 

Imitation  learning  encounters  two  major  problems  when  dealing  with  large 
state  spaces.  First,  assuming  a  tabular  representation  of  the  policy  to  be  learned 
is  likely  to  exceed  memory  due  to  the  large  state  and  action  space.  Second,  it  can 
only  make  use  of  a  limited  amount  of  expert  traces  compared  to  the  excessive 
amount  of  possible  traces.  The  implicit  feedback  gained  by  the  expert’s  traces 
on  the  best  action  to  take  in  a  state  might  be  so  sparse  that  a  well-generalizing 
policy  will  only  be  discovered  slowly. 

The  first  problem  can  be  solved  using  relational  imitation  learning  (RIL)  for 
structural  domains.  However,  the  problem  of  sparse  feedback  has  not  been  ad¬ 
dressed  by  RIL  yet.  For  relational  reinforcement  learning,  there  is  a  compelling 
and  simple  solution  to  this  problem:  inject  traces  of  execution  of  a  reasonable 
policy  for  the  task  at  hand  [22].  Unfortunately,  this  does  not  work  for  imitation 
learning.  The  input  consists  already  of  traces  of  execution  of  a  reasonable  policy, 
namely  the  policy  of  the  expert.  Thus,  we  do  not  gain  anything  despite  enlarg¬ 
ing  the  training  set  as  can  be  seen  from  our  results.  The  non-transfer  model 
seemed  to  have  converged  to  a  inferior  policy.  To  overcome  this  problem,  we 
intuitively  propose  to  inject  traces  of  a  policy  of  a  reasonably  well  related  task. 
Specifically,  we  directly  inject  the  complete  ’’related”  policy  into  a  functional 
gradient  boosting  approach  to  RIL.  This  appears  to  be  an  interesting  result  in 
that  sometimes  prior  policies  have  a  better  impact  on  the  performance  compared 
to  more  trajectories.  Our  immediate  challenge  is  to  validate  this  hypothesis  on 
other  more  complex  domains.  Another  interesting  direction  is  the  possibility 
of  employing  active  learning  methods  for  extracting  the  best  complex  examples 
given  the  initial  model  thereby  further  improving  on  the  performance  of  transfer. 
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